-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] [Enhancement] Fix nullptr and support Iceberg null padding #49212
Conversation
@@ -274,6 +275,14 @@ Status GroupReader::_create_column_reader(const GroupReaderParam::Column& column | |||
RETURN_IF_ERROR(ColumnReader::create(_column_reader_opts, schema_node, column.slot_type(), | |||
column.t_iceberg_schema_field, &column_reader)); | |||
} | |||
if (column_reader == nullptr) { | |||
if (column.t_iceberg_schema_field == nullptr) { | |||
return Status::InternalError("Invalid file: No valid column reader."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I combined the bugfix and enhancement. Lmk any comments on the null padding behavior, otherwise can split pr and at least just add return error status so doesnt segfault at e.g. column_reader->set_need_parse_levels
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Samrose-Ahmed maybe it will be better that split the bugfix and enhancement, as i know, we only materialize column that has at least one subfiled in parquet reader, for a padding column, it's padded at scanner level. cc @Smith-Cruise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes its for struct evolution in Iceberg e.g. you had event: struct<action: string>
and you evolved schema to event: struct<action: string, id:string>
, according to Iceberg spec, if you select event.id
you need to pad the old files that don't have event.id
column with null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for this case, we want to query event.id and the former files don't contains this subfield, i actually think we have covered this case, and treat the column as column that not need materialized in the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debugged and it crashes on one of our Iceberg tables. I added a unit test to replicate.
can you put crash log here? |
The crash is at this line when
Log:
|
9245031
to
76ca9a0
Compare
76ca9a0
to
143b6aa
Compare
143b6aa
to
169f130
Compare
@Samrose-Ahmed i think the root cause is in
|
You can take a look at |
That makes sense actually I'll update the PR |
169f130
to
7526bad
Compare
I've updated the PR. |
cbb02fd
to
a7d3b29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, for clang-format, could you run ./clang-format.sh
in /path/to/starrocks/code/build-support
@@ -274,6 +274,10 @@ Status GroupReader::_create_column_reader(const GroupReaderParam::Column& column | |||
RETURN_IF_ERROR(ColumnReader::create(_column_reader_opts, schema_node, column.slot_type(), | |||
column.t_iceberg_schema_field, &column_reader)); | |||
} | |||
if (column_reader == nullptr) { | |||
// this shouldn't happen but guard | |||
return Status::InternalError("No valid column reader."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job
43eb6fe
to
2ec8151
Compare
This approach doesn't seem to completely work. I updated it to use field ids but now the new test (TestStructEvolutionPadNull) fails at the same original issue (now returns 'No valid column reader', would crash before w/o check). The suggested change is fine to handle robustness for type check but it doesn't seem to handle the root issue, which is you need to pad the nested subfield with default value, that doesn't happen with this code (like comment mentions It seems like we do need the change I made earlier? thoughts @Smith-Cruise @zombee0 |
2ec8151
to
0248595
Compare
PR StarRocks#48151 introduced a regression, where what would previously return an Error now caused a nullptr dereference and crashed the entire CN. This change fixes to handle the case and return nulls. Signed-off-by: Samrose Ahmed <[email protected]>
0248595
to
ede4edb
Compare
branch 3.2 you need backport either, plz check it. |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 27 / 28 (96.43%) file detail
|
@Mergifyio backport branch-3.3 |
@Mergifyio backport branch-3.2 |
✅ Backports have been created
|
✅ Backports have been created
|
…49212) Signed-off-by: Samrose Ahmed <[email protected]> (cherry picked from commit e224d5d)
…49212) Signed-off-by: Samrose Ahmed <[email protected]> (cherry picked from commit e224d5d)
…backport #49212) (#49525) Co-authored-by: Samrose <[email protected]>
…backport #49212) (#49524) Co-authored-by: Samrose <[email protected]>
PR #48151 introduced a regression, where what would previously return an Error now caused a nullptr dereference and crashed the entire CN. This change removes the crashing and, for Iceberg, also adds support for padding evolved fields with null values, as per Iceberg spec.
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: